Knowledge-based Wrapper Generation by Using XML
نویسندگان
چکیده
Information extraction is the process of recognizing the particular fragments of a document that constitute its core semantic content. However, most previous information extraction systems were not effective for real-world information sources due to difficulties in acquiring and representing useful domain knowledge and in dealing with structural heterogeneity inherent in different sources. In order to resolve these problems, this paper proposes a scheme of knowledge-based wrapper generation for semi-structured and labeled documents. The implementation of an agent-oriented information extraction system, XTROS, is described. In contrast with previous wrapper learning agents, XTROS represents both the domain knowledge and the wrappers by XML documents to increase modularity, flexibility, and interoperability among multiple parties. XTROS also facilitates simpler implementation of the wrapper generator by exploiting XML parsers and interpreters. XTROS shows good performance on several Web sites in the domain of real estates, and it is expected to be easily adaptable to different domains by plugging in appropriate XML-based domain knowledge.
منابع مشابه
Data Extraction using Content-Based Handles
In this paper, we present an approach and a visual tool, called HWrap (Handle Based Wrapper), for creating web wrappers to extract data records from web pages. In our approach, we mainly rely on the visible page content to identify data regions on a web page. In our extraction algorithm, we inspired by the way a human user scans the page content for specific data. In particular, we use text fea...
متن کاملXWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources
The amount of useful semi-structured data on the web continues to grow at a stunning pace. Often interesting web data are not in database systems but in HTML pages, XML pages, or text les. Data in these formats is not directly usable by standard SQL-like query processing engines that support sophisticated querying and reporting beyond keyword-based retrieval. Hence, the web users or application...
متن کاملAn XML-enabled data extraction toolkit for web sources
The amount of useful semi-structured data on the web continues to grow at a stunning pace. Often interesting web data are not in database systems but in HTML pages, XML pages, or text files. Data in these formats are not directly usable by standard SQL-like query processing engines that support sophisticated querying and reporting beyond keyword-based retrieval. Hence, the web users or applicat...
متن کاملXML-Enabled Data Extraction for Web Sources
The amount of useful semi-structured data on the web continues to grow at a stunning pace. Often interesting web data are not in database systems but in HTML pages, XML pages, or text les. Data in these formats is not directly usable by standard SQL-like query processing engines that support sophisticated querying and reporting beyond keyword-based retrieval. Hence, the web users or application...
متن کاملJava-COM integration with JACOB using XML wrappers
ManyWindows-based legacy applications can be programmatically accessed using COM interfaces. However, calling COM components from Java is not straightforward. This report compares four open source Java-COM integration packages. A technique for typesafe Java-COM integration is presented. The technique is based on typesafe COM interface wrappers using jcom, java2com and JACOB libraries. Examples ...
متن کامل